AITopics

Country: North America > United States (1.00)

Genre: Research Report (0.46)

Industry:

Information Technology (1.00)
Government > Regional Government > North America Government > United States Government (0.67)
Semiconductors & Electronics (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
(3 more...)

Neural Information Processing SystemsFeb-17-2026, 23:23:09 GMT

f498c1ce6bff52eb04febf87438dd84b-Paper-Conference.pdf

large language model, machine learning, natural language, (20 more...)

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Texas (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.46)

Industry:

Information Technology (1.00)
Government > Regional Government > North America Government > United States Government (0.67)
Semiconductors & Electronics (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
(3 more...)

Neural Information Processing SystemsDec-27-2025, 05:28:30 GMT

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance.

monarch mixer, sequence length and model dimension, simple sub-quadratic gemm-based architecture, (5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.75)

arXiv.org Artificial IntelligenceOct-28-2025

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

Yaras, Can, Xu, Alec S., Abillama, Pierre, Lee, Changwoo, Balzano, Laura

Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $Θ(N\sqrt{N} d)$ computational complexity and $Θ(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the Transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.

artificial intelligence, machine learning, natural language, (17 more...)

2505.18698

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

de Lima, João Paulo Cardoso, Dietrich, Marc, Castrillon, Jeronimo, Khan, Asif Ali

Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs

arXiv.org Artificial IntelligenceOct-14-2025

Structured sparsity enables deploying large language models (LLMs) on resource-constrained systems. Approaches like dense-to-sparse fine-tuning are particularly compelling, achieving remarkable structured sparsity by reducing the model size by over 6.7x, while still maintaining acceptable accuracy. Despite this reduction, LLM inference, especially the decode stage being inherently memory-bound, is extremely expensive on conventional Von-Neumann architectures. Compute-in-memory (CIM) architectures mitigate this by performing computations directly in memory, and when paired with sparse LLMs, enable storing and computing the entire model in memory, eliminating the data movement on the off-chip bus and improving efficiency. Nonetheless, naively mapping sparse matrices onto CIM arrays leads to poor array utilization and diminished computational efficiency. In this paper, we present an automated framework with novel mapping and scheduling strategies to accelerate sparse LLM inference on CIM accelerators. By exploiting block-diagonal sparsity, our approach improves CIM array utilization by over 50%, achieving more than 4x reduction in both memory footprint and the number of required floating-point operations.

artificial intelligence, large language model, natural language, (17 more...)

2510.11192

Country: Europe > Germany (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Machine LearningJun-17-2025

Scaling Probabilistic Circuits via Monarch Matrices

Zhang, Honghua, Dang, Meihua, Wang, Benjie, Ermon, Stefano, Peng, Nanyun, Broeck, Guy Van den

Probabilistic Circuits (PCs) are tractable representations of probability distributions allowing for exact and efficient computation of likelihoods and marginals. Recent advancements have improved the scalability of PCs either by leveraging their sparse properties or through the use of tensorized operations for better hardware utilization. However, no existing method fully exploits both aspects simultaneously. In this paper, we propose a novel sparse and structured parameterization for the sum blocks in PCs. By replacing dense matrices with sparse Monarch matrices, we significantly reduce the memory and computation costs, enabling unprecedented scaling of PCs. From a theory perspective, our construction arises naturally from circuit multiplication; from a practical perspective, compared to previous efforts on scaling up tractable probabilistic models, our approach not only achieves state-of-the-art generative modeling performance on challenging benchmarks like Text8, LM1B and ImageNet, but also demonstrates superior scaling behavior, achieving the same performance with substantially less compute as measured by the number of floating-point operations (FLOPs) during training.

artificial intelligence, machine learning, matrix, (13 more...)

arXiv.org Machine Learning

2506.12383

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > Canada (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsJan-20-2025, 02:37:44 GMT

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. We ask: are there performant architectures that can scale sub-quadratically along sequence length and model dimension? We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension: Monarch matrices, a simple class of expressive structured matrices that captures many linear transforms, achieves high hardware efficiency on GPUs, and scales sub-quadratically. As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9.1 \times higher throughput at sequence length 4K.

monarch mixer, sequence length and model dimension, simple sub-quadratic gemm-based architecture, (4 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.77)

arXiv.org Artificial IntelligenceAug-30-2024

MoRe Fine-Tuning with 10x Fewer Parameters

Tan, Wenxuan, Roberts, Nicholas, Huang, Tzu-Heng, Zhao, Jitian, Cooper, John, Guo, Samuel, Duan, Chengyu, Sala, Frederic

Parameter-efficient fine-tuning (PEFT) techniques have unlocked the potential to cheaply and easily specialize large pretrained models. However, the most prominent approaches, like low-rank adapters (LoRA), depend on heuristics or rules-of-thumb for their architectural choices -- potentially limiting their performance for new models and architectures. This limitation suggests that techniques from neural architecture search could be used to obtain optimal adapter architectures, but these are often expensive and difficult to implement. We address this challenge with Monarch Rectangular Fine-tuning (MoRe), a simple framework to search over adapter architectures that relies on the Monarch matrix class. Theoretically, we show that MoRe is more expressive than LoRA. Empirically, our approach is more parameter-efficient and performant than state-of-the-art PEFTs on a range of tasks and models, with as few as 5\% of LoRA's parameters.

fine-tuning, matrix, monarch matrix, (15 more...)

2408.17383

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
Europe > Austria > Vienna (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)

arXiv.org Artificial IntelligenceJun-14-2024

Group and Shuffle: Efficient Structured Orthogonal Parametrization

Gorbunov, Mikhail, Yudin, Nikolay, Soboleva, Vera, Alanov, Aibek, Naumov, Alexey, Rakhuba, Maxim

The increasing size of neural networks has led to a growing demand for methods of efficient fine-tuning. Recently, an orthogonal fine-tuning paradigm was introduced that uses orthogonal matrices for adapting the weights of a pretrained model. In this paper, we introduce a new class of structured matrices, which unifies and generalizes structured classes from previous works. We examine properties of this class and build a structured orthogonal parametrization upon it. We then use this parametrization to modify the orthogonal fine-tuning framework, improving parameter and computational efficiency. We empirically validate our method on different domains, including adapting of text-to-image diffusion models and downstream task fine-tuning in language modeling. Additionally, we adapt our construction for orthogonal convolutions and conduct experiments with 1-Lipschitz neural networks.

artificial intelligence, machine learning, natural language, (20 more...)

2406.10019

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Hayase, Tomohiro, Karakida, Ryo

MLP-Mixer as a Wide and Sparse MLP

arXiv.org Artificial IntelligenceJun-2-2023

Multi-layer perceptron (MLP) is a fundamental component of deep learning that has been extensively employed for various problems. However, recent empirical successes in MLP-based architectures, particularly the progress of the MLP-Mixer, have revealed that there is still hidden potential in improving MLPs to achieve better performance. In this study, we reveal that the MLP-Mixer works effectively as a wide MLP with certain sparse weights. Initially, we clarify that the mixing layer of the Mixer has an effective expression as a wider MLP whose weights are sparse and represented by the Kronecker product. This expression naturally defines a permuted-Kronecker (PK) family, which can be regarded as a general class of mixing layers and is also regarded as an approximation of Monarch matrices. Subsequently, because the PK family effectively constitutes a wide MLP with sparse weights, one can apply the hypothesis proposed by Golubeva, Neyshabur and Gur-Ari (2021) that the prediction performance improves as the width (sparsity) increases when the number of weights is fixed. We empirically verify this hypothesis by maximizing the effective width of the MLP-Mixer, which enables us to determine the appropriate size of the mixing layers quantitatively.

artificial intelligence, machine learning, mlp-mixer, (17 more...)

2306.0147

Country: Asia > Japan (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)